Okay, then let's start by recapping what we did last time.
We were talking about sequential decision problems, more precisely Markov decision problems.
As an example, here is our 4x3 world.
We for now assume that our environment is fully observable and that we have a finite
set of states, in this case the 11 that we see here.
But our environment is probabilistic, so which precise state we end up in given a certain
action is unclear in this particular example, because if we move in one direction, there's
a 20% chance we accidentally walk in the wrong direction, 10% left, 10% right with
respect to whatever direction we're heading currently.
A Markov decision process consists therefore of the following.
We have a set of states.
For every state we have a set of actions.
We have a transition model, which is just the probability distribution over our states
given the current state and our action.
And we have a reward function, which we use in lieu of a proper utility function, but
which we later use to compute a utility function ultimately.
The goal of a Markov decision procedure is to find a policy.
What is a policy?
A policy is a mapping from states to actions, and an optimal policy is a policy where every
action given a certain state maximizes expected utility naturally.
So the goal obviously is to find an optimal policy.
Here is one example of one such policy.
We've also seen that depending on what rewards I pick for my states, the behavior of an agent
that executes the optimal policy will be very different depending on the rewards.
How do we get from rewards to utility functions?
Well, the problem that we have is that we cannot properly assess utility functions a
priori.
In particular, we cannot just assess or observe utility functions from human behavior.
We've seen that before in the simple case.
Now in the more complicated case, we have to take preferences over sequences of rewards,
i.e. sequences of states ultimately into account.
We largely assume that they are stationary, which just means that my preference over two
sequences is independent, only depends on the individual elements in that sequence and
not on the rest, i.e. if the initial element of these two sequences is identical, but we
prefer this sequence over this sequence, then we should also prefer this sequence over this
sequence, i.e. it does not depend on any individual element.
For stationary preferences, there are only two ways to get to a utility function ultimately.
Either I just add them up or this happening to be a special case of I discountedly add
them up, i.e. I introduce some kind of discount factor gamma between 0 and 1 and then add
the various powers of gamma in front of my rewards over the sequence.
That's one way to solve the problem that with infinite lifetimes I have to do is to add
i.e. infinitely long sequences of states and all rewards. I have a problem if I want to
accumulate some kind of fixed utility. One possible solution is also to just say we cut
off computation at some fixed time t. That's annoying because suddenly my policy needs
to take the remaining time into account with respect to how I act.
Alternative two, at some point my agent dies anyway necessarily, no matter what the result
of which policy I execute and of course variant number three which is the one that we're largely
going to deal with, we introduce a discount factor. In that case, this sequence over all
expected rewards actually converges because it's a geometric series.
Here is a list of advantages of discounted rewards. We've talked about those. Once we've
Presenters
Zugänglich über
Offener Zugang
Dauer
01:16:08 Min
Aufnahmedatum
2024-06-04
Hochgeladen am
2024-06-06 02:09:41
Sprache
en-US